STAT 679: Problem Set #4

Q1. Political Book Recommendations

In this problem, we’ll study a network dataset of Amazon bestselling US Politics books. Books are linked by an edge if they appeared together in the recommendations (“customers who bought this book also bought these other books”).

Part (a)

The code below reads in the edges and nodes associated with the network. The edges data set only contains IDs of co-recommended books, while the nodes data includes attributes associated with each book. Build a tbl_graph object to store the graph.

edge_data_path <- "../data/political-books-edges.csv"
node_data_path <- "../data/political-books-nodes.csv"
edges <- read_csv(edge_data_path, col_types = "cci")
nodes <- read_csv(node_data_path, col_types = "ccc")

books_graph <- tbl_graph(nodes, edges, directed = FALSE)
books_graph
## # A tbl_graph: 105 nodes and 441 edges
## #
## # An undirected simple graph with 1 component
## #
## # A tibble: 105 × 3
##   id    label                      political_ideology
##   <chr> <chr>                      <chr>             
## 1 0     1000 Years for Revenge     neutral           
## 2 1     Bush vs. the Beltway       conservative      
## 3 2     Charlie Wilson's War       conservative      
## 4 3     Losing Bin Laden           conservative      
## 5 4     Sleeping With the Devil    neutral           
## 6 5     The Man Who Warned America conservative      
## # ℹ 99 more rows
## #
## # A tibble: 441 × 3
##    from    to weight
##   <int> <int>  <int>
## 1     1     2      1
## 2     1     3      1
## 3     1     4      1
## # ℹ 438 more rows

Part (b)

Use the result from part (a) to visualize the network as a node-link diagram. Include the books titles in the node label, and shade in the node according to political ideology.

ggraph(books_graph, layout="kk") +
  geom_edge_link(width=0.05) +
  geom_node_point(aes(col=political_ideology),size=3) + 
  geom_node_label(aes(label=label), repel = T, label.size=0.1) + 
  labs(col="Political Ideology")

Part (c)

Create the analogous adjacency matrix visualization. Provide examples of visual queries that are easy to answer using one encoding but not the other (i.e., what is easy to see in the node-link view vs. what is easy to see in the adjacency matrix).

ggraph(books_graph, layout="matrix") +
  geom_edge_tile(show.legend = F, mirror = T) +
  geom_node_point(aes(label=label, col=political_ideology),x=-1, size=1, hjust=1) + 
  geom_node_text(aes(label=label, col=political_ideology),y=-1, size=2, angle=90, hjust=1) + 
  labs(col="Political Ideology")

# TODO: COMPARE NODE-LINK AND MATRIX

Q2. Topics in Pride and Prejudice

This problem uses LDA to analyze the full text of Pride and Prejudice. The object paragraph is a data.frame whose rows are paragraphs from the book. We have filtered very short paragraphs; e.g., from dialogue. We are interested in how the topics appearing in the book vary from the start to the end of the book, for example.

paragraphs <- read_csv("../data/paragraphs.csv")
paragraphs
## # A tibble: 1,092 × 2
##    text                                                                paragraph
##    <chr>                                                                   <dbl>
##  1 "however little known the feelings or views of such a man may be o…         1
##  2 "\"why, my dear, you must know, mrs. long says that netherfield is…         2
##  3 "\"i see no occasion for that. you and the girls may go, or you ma…         3
##  4 "\"my dear, you flatter me. i certainly _have_ had my share of bea…         4
##  5 "\"but consider your daughters. only think what an establishment i…         5
##  6 "\"you are over-scrupulous, surely. i dare say mr. bingley will be…         6
##  7 "\"i desire you will do no such thing. lizzy is not a bit better t…         7
##  8 "mr. bennet was so odd a mixture of quick parts, sarcastic humour,…         8
##  9 "mr. bennet was among the earliest of those who waited on mr. bing…         9
## 10 "\"i honour your circumspection. a fortnight's acquaintance is cer…        10
## # ℹ 1,082 more rows

Part (a)

Create a Document-Term Matrix containing word counts from across the same paragraphs. That is, the i-th row of dtm should correspond to the i-th row of paragraph. Make sure to remove all stop-words.

by_paragraph <- paragraphs %>%
  unite(document, paragraph)

word_counts <- by_paragraph %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(document, word) %>% 
  arrange(-n)

dtm <- word_counts %>%
  cast_dtm(document, word, n)

dtm
## <<DocumentTermMatrix (documents: 1092, terms: 5655)>>
## Non-/sparse entries: 30289/6144971
## Sparsity           : 100%
## Maximal term length: 17
## Weighting          : term frequency (tf)

Part (b)

Fit an LDA model to dtm using 6 topics. Set the seed by using the argument control = list(seed=479) to remove any randomness in the result.

paragraphs_lda <- LDA(dtm, k = 6, control = list(seed = 479))

topics <- tidy(paragraphs_lda, matrix = "beta") %>% arrange(-beta)
memberships <- tidy(paragraphs_lda, matrix = "gamma")

Part (c)

Visualize the top 30 words within each of the fitted topics. Specifically, create a faceted bar chart where the lengths of the bars correspond to word probabilities and the facets correspond to topics. Reorder the bars so that each topic’s top words are displayed in order of decreasing probability.

top_words <- topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 30) %>%
  mutate(term = reorder_within(term, beta, topic))

ggplot(top_words) + 
  geom_col(aes(beta, term, fill=factor(topic)), show.legend = F) + 
  facet_wrap(~ topic, scales = "free") + 
  labs(title="Beta matrix of top 30 words in each topic", y="", x="") +
  scale_fill_brewer(palette = "Set2") +
  scale_y_reordered() + 
  scale_x_continuous(expand = c(0,0))

Part (d)

Find the paragraph that is the purest representative of Topic 2. That is, if \(\gamma_{ik}\) denotes the weight of topic \(k\) in paragraph \(i\), then print out paragraph \(i^{*}\) where \(i^{*}\) = \(arg max_{i} \gamma_{i2}\). Verify that the at least a few of the words with high probability for this topic appear. Only copy the first sentence into your solution.

topic2_members <- memberships %>% filter(topic==2)

pure_topic2_doc <- (topic2_members %>% arrange(-gamma) %>% pull(document))[1]

pure_topic2_para <- by_paragraph %>% filter(document == pure_topic2_doc) %>% pull(text)

paste0("Topic 2 is purely represented by paragraph #",pure_topic2_doc)
## [1] "Topic 2 is purely represented by paragraph #546"
paste0(str_extract(pure_topic2_para, "^[^.!?]*[.!?]"))
## [1] "\"i had not been long in hertfordshire, before i saw, in common with others, that bingley preferred your elder sister to any other young woman in the country."

Q3. Food nutrients

This problem will use PCA to provide a low-dimensional view of a 14-dimensional nutritional facts data set. The data were originally curated by the USDA and are regularly used in visualization studies.

nutrients <- read_csv("../data/nutrients.csv") %>% select(-id, -group_lumped)

Part (a)

Define a tidymodels recipe that normalizes all nutrient features and specifies that PCA should be performed.

pca_rec <- recipe(~., data = nutrients) %>%
  update_role(name:group, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors())

pca_prep <- prep(pca_rec)

Part (b)

Visualize the top 6 principal components. What types of food do you expect to have low or high values for PC1 or PC2?

I expect Fat sand Oils to have the lowest value for PC1, and Sweets to have the highest value for PC2.

components <- tidy(pca_prep, 2)

ggplot(components %>% filter(component %in% glue("PC{1:6}"))) +
  geom_col(aes(value, terms), show.legend = F) +
  scale_fill_gradient() +
  facet_wrap(~ component) + 
  labs(title="Top 6 principal components of food nutrients", x="Component Value", y="")

Part (c)

Compute the average value of PC2 within each category of the group column. Give the names of the groups sorted by this average.

sample_scores <- bake(pca_prep, NULL)

PC2_groups <- sample_scores %>% 
  group_by(group) %>% 
  summarize(avg_PC2 = mean(PC2)) %>% 
  arrange(-avg_PC2) %>% 
  pull(group)

The groups sorted (highest -> lowest) by average values of PC2 are:

Spices and Herbs, Breakfast Cereals, Sweets, Snacks, Cereal Grains and Pasta, Baked Products, Beverages, Fruits and Fruit Juices, Legumes and Legume Products, Baby Foods, Meals, Entrees, and Sidedishes, Vegetables and Vegetable Products, Fast Foods, Dairy and Egg Products, Restaurant Foods, Nut and Seed Products, Soups, Sauces, and Gravies, Ethnic Foods, Finfish and Shellfish Products, Poultry Products, Pork Products, Beef Products, Sausages and Luncheon Meats, Lamb, Veal, and Game Products, Fats and Oils

Part (d)

Visualize the scores of each food item with respect to the first two principal components. Facet the visualization according to the group column, and sort the facets according to the results of part (c). How does the result compare with your guess from part (b)?

  • From the below plot, I can confirm that my guess about \(Fats\) \(and\) \(Oils\) to have a low PC1 was accurate.
  • \(Spices\) \(and\) \(Herbs\) seems to have a higher value of PC2, as compared to my guess of \(Sweets\).
ggplot(sample_scores) +
  geom_hline(yintercept = 0, size = 0.5, col = "#5d5d5d") +
  geom_vline(xintercept = 0, size = 0.5, col = "#5d5d5d") +
  geom_point(aes(PC1, PC2), size=0.3, alpha=0.4) +
  facet_wrap(~ factor(group, levels = PC2_groups)) + 
  labs(title="Score of food items w.r.t. principal components")

Q4. Interactive Phylogeny

We will build an interactive phylogenetic tree of \(SARS-CoV-2\) genetic sequences. Each sequence has been annotated with a date and location of its discovery. We will use D3 to allow readers to explore the way genetic changes unfold over time and space. You can find the raw data here: nodes, edges. We have provided starter code to build a d3.stratify() object from the edge data and to define an object, node_lookup, which can be used to look up the country and date associated with the from and to fields in the edges.

Part (a)

Create a static tree visualization that shows how the different COVID variants evolved from one another. Use color to encode the location of the variant’s discovery. You may group rare countries into “Other,” and draw variants with unknown origins using either white or grey.

Code of (a) and (b):

Part (b)

Implemented Interactivity: As the user hovers near to a node, highlight all of its ancestors. Blend the rest of the nodes into the background.

Part (c)

Propose, but do not implement, an extend version of part (b) that is linked with an additional table or visualization. How would the second graphic be updated in response to user interactions? What additional queries become possible in your proposed visualization?

Extensions to my interaction would be,

  • displaying the country name when a leaf node is highlighted (already implemented in current code) and ,
  • showing the number of descendant countries when user hovers over a a non-leaf node.